In May 2017, this started out as a demonstration that Scanpy would allow to reproduce most of Seurat's guided clustering tutorial (Satija et al., 2015).
We gratefully acknowledge Seurat's authors for the tutorial! In the meanwhile, we have added and removed a few pieces.
The data consist of 3k PBMCs from a Healthy Donor and are freely available from 10x Genomics (here from this webpage). On a unix system, you can uncomment and run the following to download and unpack the data. The last line creates a directory for writing processed data.
# !mkdir data
# !wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz
# !cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
# !mkdir write
#!pip3 install tabulate
#!pip3 install numpy
#!python3 -m pip install tabulate
Requirement already satisfied: tabulate in /usr/local/lib/python3.9/site-packages (0.8.9) Requirement already satisfied: numpy in /usr/local/lib/python3.9/site-packages (1.20.2)
import numpy as np
import pandas as pd
import scanpy as sc
import os
import sys
import umap
from tabulate import tabulate
sys.version #if anything doesn't load, check that this is in the same folder as the module you are trying to import
ERROR: Could not find a version that satisfies the requirement scanpy.plotting.anndata (from versions: none) ERROR: No matching distribution found for scanpy.plotting.anndata
'3.9.1 (default, Dec 17 2020, 10:08:12) \n[Clang 12.0.0 (clang-1200.0.32.27)]'
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header() #?throws an error, can troubleshoot if necessary
sc.settings.set_figure_params(dpi=80, facecolor='white')
scanpy==1.7.2 anndata==0.7.6 umap==0.5.1 numpy==1.20.2 scipy==1.6.2 pandas==1.2.4 scikit-learn==0.24.1 statsmodels==0.12.2
results_file = 'write/LDAVM02Python.h5ad' # the file that will store the analysis results; 'write' must be a folder
Read in the count matrix into an AnnData object, which holds many slots for annotations and different representations of the data. It also comes with its own HDF5 file format: .h5ad.
Define where the data folders are
folder = "rawdata/CellRanger" #outer folder containing all samples
datanames = os.listdir(folder) #sample folders
print(datanames)
datanames.remove(".DS_Store")
print(datanames)
innerfolder = "filtered_feature_bc_matrix" #folder inside each data folder containing the matrix files
['LD7RC', 'LD5LD', '.DS_Store', 'LD5RC', 'LD7LD'] ['LD7RC', 'LD5LD', 'LD5RC', 'LD7LD']
Make a loop that will look inside each data folder and read the matrix files
d = dict()
for sample in datanames:
samplename = str(sample)
newdata = sc.read_10x_mtx(
os.path.join(folder,sample,innerfolder), # the directory with the `.mtx` filevar_names='gene_symbols', # use gene symbols for the variable names (variables-axis index)
cache=True)
newdata.obs_names = newdata.obs_names + '-' + samplename
newdata.obs['Sample'] = sample
newdata.obs['Age'] = 'P' + sample[2] #this is specific to your naming strategy
if sample[4]=='D':
newdata.obs['Condition'] = 'Deprived'
if sample[4]=='C':
newdata.obs['Condition'] = 'Control'
d[samplename] = newdata.copy()
print(d)
... reading from cache file cache/rawdata-CellRanger-LD7RC-filtered_feature_bc_matrix-matrix.h5ad ... reading from cache file cache/rawdata-CellRanger-LD5LD-filtered_feature_bc_matrix-matrix.h5ad ... reading from cache file cache/rawdata-CellRanger-LD5RC-filtered_feature_bc_matrix-matrix.h5ad ... reading from cache file cache/rawdata-CellRanger-LD7LD-filtered_feature_bc_matrix-matrix.h5ad
{'LD7RC': AnnData object with n_obs × n_vars = 9265 × 31053
obs: 'Sample', 'Age', 'Condition'
var: 'gene_ids', 'feature_types', 'LD5LD': AnnData object with n_obs × n_vars = 9532 × 31053
obs: 'Sample', 'Age', 'Condition'
var: 'gene_ids', 'feature_types', 'LD5RC': AnnData object with n_obs × n_vars = 8607 × 31053
obs: 'Sample', 'Age', 'Condition'
var: 'gene_ids', 'feature_types', 'LD7LD': AnnData object with n_obs × n_vars = 7855 × 31053
obs: 'Sample', 'Age', 'Condition'
var: 'gene_ids', 'feature_types'}
adata = d[datanames[0]]
for sample in datanames[1:len(datanames)]:
adata = adata.concatenate(d[sample],index_unique = None)
adata
AnnData object with n_obs × n_vars = 35259 × 31053
obs: 'Sample', 'Age', 'Condition', 'batch'
var: 'gene_ids', 'feature_types'
print(adata.obs['Sample'].value_counts()) #this tells you how many cells are in each sample
print(adata.obs['Age'].value_counts())
print(adata.obs['Condition'].value_counts())
adata.obs['SampleDescription'] = adata.obs['Age'] + '_' + adata.obs['Condition']
print(adata.obs['SampleDescription'].value_counts())
LD5LD 9532 LD7RC 9265 LD5RC 8607 LD7LD 7855 Name: Sample, dtype: int64 P5 18139 P7 17120 Name: Age, dtype: int64 Control 17872 Deprived 17387 Name: Condition, dtype: int64 P5_Deprived 9532 P7_Control 9265 P5_Control 8607 P7_Deprived 7855 Name: SampleDescription, dtype: int64
adata.var_names_make_unique() # this is unnecessary if using `var_names='gene_ids'` in `sc.read_10x_mtx`
Show those genes that yield the highest fraction of counts in each single cell, across all cells.
sc.pl.highest_expr_genes(adata, n_top=20, )
normalizing counts per cell
finished (0:00:01)
Basic filtering:
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=1)
filtered out 125 cells that have less than 200 genes expressed filtered out 7633 genes that are detected in less than 1 cells
Let's assemble some information about mitochondrial genes, which are important for quality control.
Citing from "Simple Single Cell" workflows (Lun, McCarthy & Marioni, 2017):
High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of loss of cytoplasmic RNA from perforated cells. The reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane.
With pp.calculate_qc_metrics, we can compute many metrics very efficiently.
adata.var['mt'] = adata.var_names.str.startswith('mt-') # annotate the group of mitochondrial genes as 'mt'
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
A violin plot of some of the computed quality measures:
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
jitter=0.4, multi_panel=True)
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
jitter=0.4, multi_panel=True,groupby = 'SampleDescription',rotation = 30) #rotation is degrees for x labels
... storing 'Sample' as categorical ... storing 'Age' as categorical ... storing 'Condition' as categorical ... storing 'SampleDescription' as categorical ... storing 'feature_types' as categorical
Remove cells that have too many mitochondrial genes expressed or too many total counts:
sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt',color = 'SampleDescription',palette = ['grey','pink','blue','red'])
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts',color = 'SampleDescription')
Actually do the filtering by slicing the AnnData object.
adata = adata[adata.obs.n_genes_by_counts > 2500, :] #min features
adata = adata[adata.obs.n_genes_by_counts < 7500, :] #max features
adata = adata[adata.obs.pct_counts_mt < 7.5, :] #max pct mito
adata = adata[adata.obs.total_counts > 5000, :] #min counts
adata = adata[adata.obs.total_counts < 35000, :] #max counts
adata
View of AnnData object with n_obs × n_vars = 13060 × 23420
obs: 'Sample', 'Age', 'Condition', 'batch', 'SampleDescription', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
uns: 'SampleDescription_colors'
Total-count normalize (library-size correct) the data matrix $\mathbf{X}$ to 10,000 reads per cell, so that counts become comparable among cells.
sc.pp.normalize_total(adata, target_sum=1e4)
/usr/local/lib/python3.9/site-packages/scanpy/preprocessing/_normalization.py:138: UserWarning: Revieved a view of an AnnData. Making a copy.
view_to_actual(adata)
normalizing counts per cell
finished (0:00:00)
Logarithmize the data:
sc.pp.log1p(adata)
Identify highly-variable genes.
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
extracting highly variable genes
finished (0:00:00)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
sc.pl.highly_variable_genes(adata)
Set the .raw attribute of the AnnData object to the normalized and logarithmized raw gene expression for later use in differential testing and visualizations of gene expression. This simply freezes the state of the AnnData object.
adata.raw = adata
Actually do the filtering
adata = adata[:, adata.var.highly_variable]
Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed. Scale the data to unit variance.
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
regressing out ['total_counts', 'pct_counts_mt']
sparse input is densified and may lead to high memory use
finished (0:00:23)
Scale each gene to unit variance. Clip values exceeding standard deviation 10.
sc.pp.scale(adata, max_value=10)
Reduce the dimensionality of the data by running principal component analysis (PCA), which reveals the main axes of variation and denoises the data.
sc.tl.pca(adata, svd_solver='arpack')
computing PCA
on highly variable genes
with n_comps=50
finished (0:00:02)
We can make a scatter plot in the PCA coordinates, but we will not use that later on.
sc.pl.pca(adata, color='SampleDescription')
Let us inspect the contribution of single PCs to the total variance in the data. This gives us information about how many PCs we should consider in order to compute the neighborhood relations of cells, e.g. used in the clustering function sc.tl.louvain() or tSNE sc.tl.tsne(). In our experience, often a rough estimate of the number of PCs does fine.
sc.pl.pca_variance_ratio(adata, log=True)
Save the result.
adata.write(results_file)
adata
AnnData object with n_obs × n_vars = 13060 × 2181
obs: 'Sample', 'Age', 'Condition', 'batch', 'SampleDescription', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'SampleDescription_colors', 'log1p', 'hvg', 'pca'
obsm: 'X_pca'
varm: 'PCs'
Let us compute the neighborhood graph of cells using the PCA representation of the data matrix. You might simply use default values here. For the sake of reproducing Seurat's results, let's take the following values.
sc.pp.neighbors(adata,n_neighbors = 15, n_pcs = 30, method = 'umap',use_rep = "X_pca",random_state = 1)
#sc.pp.neighbors(adata)
computing neighbors
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:17)
We suggest embedding the graph in two dimensions using UMAP (McInnes et al., 2018), see below. It is potentially more faithful to the global connectivity of the manifold than tSNE, i.e., it better preserves trajectories. In some occasions, you might still observe disconnected clusters and similar connectivity violations. They can usually be remedied by running:
tl.paga(adata)
pl.paga(adata, plot=False) # remove `plot=False` if you want to see the coarse-grained graph
tl.umap(adata, init_pos='paga')
sc.tl.umap(adata)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:00:08)
sc.pl.umap(adata, color=['Aif1','Rbfox3','Sox9'])
As we set the .raw attribute of adata, the previous plots showed the "raw" (normalized, logarithmized, but uncorrected) gene expression. You can also plot the scaled and corrected gene expression by explicitly stating that you don't want to use .raw.
sc.pl.umap(adata, color=['Lyz2','Rbfox3','Sox9'], use_raw=False)
As with Seurat and many other frameworks, we recommend the Leiden graph-clustering method (community detection based on optimizing modularity) by Traag et al. (2018). Note that Leiden clustering directly clusters the neighborhood graph of cells, which we already computed in the previous section.
import leidenalg
sc.tl.leiden(adata)
running Leiden clustering
finished: found 25 clusters and added
'leiden', the cluster labels (adata.obs, categorical) (0:00:03)
Plot the clusters, which agree quite well with the result of Seurat.
sc.pl.umap(adata, color=['leiden', 'Ifitm3', 'Rbfox3','Sox9','Adgrb1'])
Save the result.
adata.write(results_file)
Let us compute a ranking for the highly differential genes in each cluster. For this, by default, the .raw attribute of AnnData is used in case it has been initialized before. The simplest and fastest method to do so is the t-test.
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=10, sharey=False)
ranking genes
finished (0:00:37)
sc.settings.verbosity = 2 # reduce the verbosity
The result of a Wilcoxon rank-sum (Mann-Whitney-U) test is very similar. We recommend using the latter in publications, see e.g., Sonison & Robinson (2018). You might also consider much more powerful differential testing packages like MAST, limma, DESeq2 and, for python, the recent diffxpy.
Save the result.
adata.write(results_file)
As an alternative, let us rank genes using logistic regression. For instance, this has been suggested by Natranos et al. (2018). The essential difference is that here, we use a multi-variate appraoch whereas conventional differential tests are uni-variate. Clark et al. (2014) has more details.
Let us also define a list of marker genes for later reference.
marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']
Reload the object that has been save with the Wilcoxon Rank-Sum test result.
Show the 10 top ranked genes per cluster 0, 1, ..., 7 in a dataframe.
pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(5)
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Mcm3 | Top2a | Hist1h2bc | Itm2b | Calr | Meg3 | P2ry12 | Ftl1 | Dtl | Rrm2 | ... | Top2a | Nfib | Slc1a3 | Pf4 | Scrg1 | Meg3 | Hist1h2ap | Ifitm3 | Iqgap1 | Mbp |
| 1 | Mcm6 | Mki67 | Arl6ip1 | Fcgr3 | Pfn1 | Gm42418 | Ltc4s | Ctsz | Uhrf1 | Atad2 | ... | Hist1h2ap | Sox11 | Ptn | Ms4a7 | Olig1 | Syt1 | Hist1h2ae | Rnf213 | S100a6 | Nfasc |
| 2 | Mcm4 | Hist1h2ae | Ccnb2 | Cst3 | Actb | Ptprd | Siglech | Mt1 | Lig1 | Pclaf | ... | Cenpf | Nfix | Atp1a2 | Dab2 | Olig2 | Ahi1 | Top2a | Bst2 | S100a11 | Bcas1 |
| 3 | Hells | Hist1h1e | Tubb4b | Selenop | Arhgdib | Ahi1 | Ccr5 | Ctsb | Atad2 | Hist1h1b | ... | Prc1 | Nedd4 | Slc1a2 | Stab1 | Sox8 | Map1b | Kif15 | Ly6e | Vim | Plp1 |
| 4 | Dtl | Smc4 | Gpx1 | Csf1r | Cotl1 | Pfkp | Qk | Cd63 | Hells | Esco2 | ... | Hist1h1b | Tuba1a | Plpp3 | Mrc1 | Ncald | Ank3 | Hjurp | Isg15 | Lgals3 | Bcan |
5 rows × 25 columns
Get a table with the scores and groups.
pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(5)
result = adata.uns['rank_genes_groups']
groups = result['names'].dtype.names
pd.DataFrame(
{group + '_' + key[:1]: result[key][group]
for group in groups for key in ['names', 'pvals']}).head(5)
| 0_n | 0_p | 1_n | 1_p | 2_n | 2_p | 3_n | 3_p | 4_n | 4_p | ... | 20_n | 20_p | 21_n | 21_p | 22_n | 22_p | 23_n | 23_p | 24_n | 24_p | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Mcm3 | 1.544734e-308 | Top2a | 0.0 | Hist1h2bc | 0.000000e+00 | Itm2b | 0.000000e+00 | Calr | 7.503760e-118 | ... | Meg3 | 1.768991e-69 | Hist1h2ap | 8.174857e-47 | Ifitm3 | 1.903460e-43 | Iqgap1 | 2.990225e-38 | Mbp | 1.001906e-16 |
| 1 | Mcm6 | 3.525314e-262 | Mki67 | 0.0 | Arl6ip1 | 0.000000e+00 | Fcgr3 | 3.060458e-219 | Pfn1 | 2.089196e-98 | ... | Syt1 | 2.005751e-69 | Hist1h2ae | 2.032713e-46 | Rnf213 | 6.221992e-42 | S100a6 | 4.483125e-38 | Nfasc | 1.378656e-15 |
| 2 | Mcm4 | 1.080856e-206 | Hist1h2ae | 0.0 | Ccnb2 | 0.000000e+00 | Cst3 | 1.131150e-210 | Actb | 9.728325e-89 | ... | Ahi1 | 2.165159e-68 | Top2a | 7.883656e-45 | Bst2 | 2.124582e-39 | S100a11 | 6.034085e-38 | Bcas1 | 1.940994e-15 |
| 3 | Hells | 1.960542e-204 | Hist1h1e | 0.0 | Tubb4b | 1.416144e-283 | Selenop | 2.015644e-206 | Arhgdib | 2.134147e-86 | ... | Map1b | 2.507818e-68 | Kif15 | 3.029763e-44 | Ly6e | 9.228578e-39 | Vim | 3.709086e-37 | Plp1 | 2.231620e-15 |
| 4 | Dtl | 4.139074e-197 | Smc4 | 0.0 | Gpx1 | 3.104734e-256 | Csf1r | 1.761988e-190 | Cotl1 | 2.081825e-85 | ... | Ank3 | 5.586912e-68 | Hjurp | 4.507159e-44 | Isg15 | 1.192386e-37 | Lgals3 | 1.704403e-36 | Bcan | 3.089506e-15 |
5 rows × 50 columns
Compare to a single cluster:
sc.tl.rank_genes_groups(adata, 'leiden', groups=['0'], reference='1', method='wilcoxon')
sc.pl.rank_genes_groups(adata, groups=['0'], n_genes=20)
ranking genes
finished (0:00:01)
If we want a more detailed view for a certain group, use sc.pl.rank_genes_groups_violin.
sc.pl.rank_genes_groups_violin(adata, groups='0', n_genes=8)
Reload the object with the computed differential expression (i.e. DE via a comparison with the rest of the groups):
#adata = sc.read(results_file)
sc.pl.rank_genes_groups_violin(adata, groups='0', n_genes=8)
If you want to compare a certain gene across groups, use the following.
sc.pl.violin(adata,['Aif1'],groupby = 'leiden',rotation = 30,save = "Aif1")
print(type(ax))
#ax.set_xlabel('test')
sc.pl.violin(adata, ['Olig2','Mbp'], groupby='leiden',rotation = 30)
sc.pl.violin(adata, ['Rbfox3','Sox9'], groupby='leiden',rotation = 30)
#low microglial markers are in clusters 16,17,19,20,23*;
#oligo markers are in 16,17,19,24;
#neuronal is 20, 21, 5;
#astro 16,17,19
#Definitely MG: 0,1,2,3,4,6,7,8,9,10,11,12,13,14,15
#Definitely not MG: 16,17,19
#Possibly oligo: 24
#Possibly neuronal: 20,21,5
WARNING: saving figure to file figures/violinAif1.pdf
<class 'matplotlib.axes._axes.Axes'>
Make a clustering tree
sc.tl.dendrogram(adata,groupby = 'leiden',optimal_ordering=True)
sc.pl.dendrogram(adata,groupby = 'leiden',orientation='left')
#low microglial markers are in clusters 16,17,19,20,23*;
#oligo markers are in 16,17,19,24;
#neuronal is 20, 21, 5;
#astro 16,17,19
#Definitely MG: 0,1,2,3,4,6,7,8,9,10,11,12,13,14,15
#Definitely not MG: 16,17,19,20
#Possibly oligo: 24
#Possibly neuronal: 20,21,5
#clustering with 16,17,19: 24,5,20; 21,15,1,9
#remove 16,17,19,20 as a first pass
using 'X_pca' with n_pcs = 50 Storing dendrogram info using `.uns['dendrogram_leiden']`
<AxesSubplot:>
#go back to raw dataset - start here before subsetting
adata.raw.to_adata()
adata = adata[adata.obs['total_counts']>1]
adata
View of AnnData object with n_obs × n_vars = 13060 × 2181
obs: 'Sample', 'Age', 'Condition', 'batch', 'SampleDescription', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden'
var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'SampleDescription_colors', 'dendrogram_leiden', 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
adata_subset = adata[~adata.obs['leiden'].isin(['16','17','19','20','23'])] #~ reverses the boolean
adata_subset_new=sc.AnnData(adata_subset.raw.X,obs=adata_subset.obs,var=adata_subset.raw.var)
print(adata.obs['leiden'].value_counts())
print(adata_subset_new.obs['leiden'].value_counts())
0 1408 1 1259 2 1176 3 1149 4 1038 5 974 6 780 7 737 8 712 9 647 10 547 11 403 12 330 13 322 14 265 15 251 16 243 17 182 18 144 19 134 20 105 21 99 22 73 23 58 24 24 Name: leiden, dtype: int64 0 1408 1 1259 2 1176 3 1149 4 1038 5 974 6 780 7 737 8 712 9 647 10 547 11 403 12 330 13 322 14 265 15 251 18 144 21 99 22 73 24 24 Name: leiden, dtype: int64
adata = adata_subset_new
Actually mark the cell types.
sc.pl.umap(adata_subset, color='leiden', legend_loc='on data', title='', frameon=False, save='filtered.pdf')
WARNING: saving figure to file figures/umapfiltered.pdf
Renormalize and cluster
sc.pp.normalize_total(adata, target_sum=1e4)
normalizing counts per cell
finished (0:00:00)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)
adata.raw = adata
extracting highly variable genes
finished (0:00:01)
adata = adata[:, adata.var.highly_variable]
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
sc.pp.scale(adata, max_value=10)
regressing out ['total_counts', 'pct_counts_mt']
sparse input is densified and may lead to high memory use
finished (0:00:26)
results_file = 'write/LDAVM02PythonFiltered.h5ad' # the file that will store the analysis results; 'write' must be a folder
sc.tl.pca(adata, svd_solver='arpack')
sc.pl.pca(adata, color='SampleDescription',palette = ['grey','pink','blue','red'])
sc.pl.pca_variance_ratio(adata, log=True)
adata.write(results_file)
computing PCA
on highly variable genes
with n_comps=50
finished (0:00:02)
sc.pp.neighbors(adata,n_neighbors = 15, n_pcs = 30, method = 'umap',use_rep = "X_pca",random_state = 1)
sc.tl.umap(adata)
computing neighbors
finished (0:00:00)
computing UMAP
finished (0:00:08)
sc.pl.umap(adata, color=['Aif1','Rbfox3','Sox9','P2ry12','Cx3cr1'])
sc.tl.leiden(adata,resolution = 0.25)
running Leiden clustering
finished (0:00:02)
sc.pl.umap(adata, color=['leiden', 'Ifitm3', 'Rbfox3','Sox9','Adgrb1','Mbp','Mki67','Hist1h1e','pct_counts_mt'])
adata.write(results_file)
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=10, sharey=False)
ranking genes
finished (0:00:28)
pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(5)
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
|---|---|---|---|---|---|---|---|---|
| 0 | Lgmn | Top2a | Dtl | Meg3 | H2afz | Igf1 | Ifitm3 | Pf4 |
| 1 | Tyrobp | Prc1 | Mcm3 | Gm42418 | Tubb4b | Ctsb | Bst2 | Dab2 |
| 2 | C1qc | Mki67 | Mcm6 | Ahi1 | Hmgb1 | Ftl1 | Isg15 | Ms4a7 |
| 3 | Itm2b | Smc4 | Hells | Nrxn1 | Arl6ip1 | Ctsl | Rnf213 | F13a1 |
| 4 | C1qb | Kif11 | Mcm7 | Map1b | Ppia | Syngr1 | Oasl2 | Stab1 |
result = adata.uns['rank_genes_groups']
groups = result['names'].dtype.names
DElist = pd.DataFrame(
{group + '_' + key[:]: result[key][group]
for group in groups for key in ['names', 'pvals_adj','logfoldchanges']}) #lfc = log2fold change
print(DElist.head(5))
DElist.to_csv(
'./write/LDAVM02filtered_DEgenes.csv')
homeostatic-0_names homeostatic-0_pvals_adj homeostatic-0_logfoldchanges \ 0 Lgmn 0.0 0.143232 1 Tyrobp 0.0 0.184394 2 C1qc 0.0 0.130258 3 Itm2b 0.0 0.214588 4 C1qb 0.0 0.124799 pro-1_names pro-1_pvals_adj pro-1_logfoldchanges pro-2_names \ 0 Top2a 0.0 3.811081 Dtl 1 Prc1 0.0 4.751154 Mcm3 2 Mki67 0.0 3.513653 Mcm6 3 Smc4 0.0 2.986764 Hells 4 Kif11 0.0 4.394249 Mcm7 pro-2_pvals_adj pro-2_logfoldchanges neural-3_names ... \ 0 0.0 3.167744 Meg3 ... 1 0.0 1.689434 Gm42418 ... 2 0.0 2.235011 Ahi1 ... 3 0.0 2.843037 Nrxn1 ... 4 0.0 2.060964 Map1b ... pro-4_logfoldchanges immature-5_names immature-5_pvals_adj \ 0 0.682901 Igf1 1.109841e-230 1 1.890332 Ctsb 5.544451e-222 2 0.470314 Ftl1 7.072228e-221 3 0.633493 Ctsl 4.885533e-214 4 0.274581 Syngr1 1.593855e-204 immature-5_logfoldchanges ifn1-6_names ifn1-6_pvals_adj \ 0 4.363831 Ifitm3 8.914806e-135 1 0.577883 Bst2 4.960241e-134 2 0.491041 Isg15 2.059319e-133 3 0.622643 Rnf213 2.491599e-133 4 1.384471 Oasl2 1.368551e-120 ifn1-6_logfoldchanges macro-7_names macro-7_pvals_adj \ 0 4.771772 Pf4 1.355874e-84 1 1.545685 Dab2 1.649560e-83 2 5.642232 Ms4a7 1.524828e-81 3 3.545274 F13a1 2.348573e-81 4 5.494273 Stab1 1.035170e-78 macro-7_logfoldchanges 0 6.485342 1 2.095860 2 7.469332 3 8.719284 4 1.580288 [5 rows x 24 columns]
adata.uns['rank_genes_groups'].keys()
dict_keys(['params', 'names', 'scores', 'pvals', 'pvals_adj', 'logfoldchanges'])
sc.pl.violin(adata,['Aif1'],groupby = 'leiden',rotation = 30,save = "Aif1-filtered")
WARNING: saving figure to file figures/violinAif1-filtered.pdf
sc.tl.dendrogram(adata,groupby = 'leiden',optimal_ordering=True)
sc.pl.dendrogram(adata,groupby = 'leiden',orientation='left')
using 'X_pca' with n_pcs = 50 Storing dendrogram info using `.uns['dendrogram_leiden']`
<AxesSubplot:>
new_cluster_names = [
'homeostatic-0', 'pro-1',
'pro-2', 'neural-3',
'pro-4', 'immature-5',
'ifn1-6', 'macro-7']
adata.rename_categories('leiden', new_cluster_names)
marker_genes = ['Tgfbr1','Hexb','P2ry12','Tmem119','Mki67','Top2a','Nrxn1','Igf1','Spp1','Ifitm3','Pf4']
Now that we annotated the cell types, let us visualize the marker genes.
sc.pl.dotplot(adata, marker_genes, groupby='leiden');
There is also a very compact violin plot.
sc.pl.stacked_violin(adata, marker_genes, groupby='leiden', rotation=90);
During the course of this analysis, the AnnData accumlated the following annotations.
adata
AnnData object with n_obs × n_vars = 12338 × 1963
obs: 'Sample', 'Age', 'Condition', 'batch', 'SampleDescription', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden'
var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'log1p', 'hvg', 'pca', 'SampleDescription_colors', 'neighbors', 'umap', 'leiden', 'leiden_colors', 'rank_genes_groups', 'dendrogram_leiden'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'distances', 'connectivities'
adata.write(results_file, compression='gzip') # `compression='gzip'` saves disk space, but slows down writing and subsequent reading
Get a rough overview of the file using h5ls, which has many options - for more details see here. The file format might still be subject to further optimization in the future. All reading functions will remain backwards-compatible, though.
If you want to share this file with people who merely want to use it for visualization, a simple way to reduce the file size is by removing the dense scaled and corrected data matrix. The file still contains the raw data used in the visualizations in adata.raw.
adata.raw.to_adata().write('./write/LDAVM02filtered_withoutX.h5ad')
If you want to export to "csv", you have the following options:
# Export single fields of the annotation of observations
adata.obs[['SampleDescription', 'leiden']].to_csv(
'./write/pbmc3k_corrected_leiden_groups.csv')
# Export single columns of the multidimensional annotation
pd.DataFrame(adata.uns['rank_genes_groups']['names']).to_csv(
'./write/pbmc3k_DEgenes.csv')
adata.obsm.to_df()[['X_umap1', 'X_umap2']].to_csv(
'./write/pbmc3k_corrected_X_pca.csv')
# Or export everything except the data using `.write_csvs`.
#Set `skip_data=False` if you also want to export the data.
adata.write_csvs(results_file[:-5], )
writing .csv files to write/LDAVM02PythonFiltered